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Abstract 

We investigate a neutral model for speciation and extinction, the constant 
rate birth-death process. The process is conditioned to have n extant species 
today, we look at the tree distribution of the reconstructed trees- i.e. the trees 
without the extinct species. Whereas the tree shape distribution is well-known 
and actually the same as under the pure birth process, no analytic results for the 
speciation times were known. We provide the distribution for the speciation times 
and calculate the expectations analytically. This characterizes the reconstructed 
trees completely. We will show how the results can be used to date phylogenies. 

Keywords: Phylogenetics, macroevolutionary models, birth-death process, recon- 
structed process. 

1 Introduction 

Phylogenetics is the science of reconstructing the evolutionary history of lineages (usu- 
ally species) . Besides providing data for systematics and for taxonomy, phylogenies are 
the pattern of past diversification and so can be analysed to infer past macroevolution- 



ary process. The first common step is to comp are the reconstructed trees with expec 



tations from neutral models of diversification (iGould et al. 



1977 



Mooers and Heard 



1997 : 



Nee et al. 



1992 



Raup et al.l . I1973l) . The simplest class of neutral model are en- 



tirely homogeneous, and assume that throughout time, whenever a speciation (or ex- 
tinction) event occurs, each species is equally likely to be the one undergoing that event. 
Of course speciation is not just random - lineages will differ in their expec ted diversi- 



fication rates for both instrinsic and extrinsic factors (IMooers et al. 



20071 ). However, 



a neutral model is often used as a null model t o analyze the data, with departures 



pointing the way to more sophisticated scenarios (IHarvev et al. 



994) 



1968 



Kendalll . 



19481) 



We investigate the constant rate birth-death process (iFeller 
as it is probably the most popular homogeneous model. A birth-death process is a 
stochastic process which starts with an initial species. A species gives birth to a new 
species after exponential (rate A) waiting times and dies after an exponential (rate 
/i) waiting time. Throughout this paper, we will have < ;U < A. In the following, 
time is today and tor the origin of the tree, so time is increas ing going i nto the 



past. Special cases of the birth-death proc ess are the Yule model flYu. 



fi = and the critical branching process ( lAldous and Popovic 



2005 



1924 



Popovic 



where 



2004h 



where = A. When looking at phylogenies, we have a given number, say n, of extant 
taxa. We therefore condition the process to have n species today, we call that process 
the conditioned birth-death process (cBDP). The age of the tree, i.e. the time since 
origin of the birth-death process is tor] if tor is not known , we assume a unifor r n prio r 



on (0, oo ' 



for the time of origin as it has been done in 



Aldous and Popovid (l2005l ): 



Popovid (120041 ). Note that a tree which evolved under a birth-death process includes 
extinct species, it is called the complete tree. From the complete tree, delete the extinct 
lineages. This is the reconstructed tree shape, see Figure [TJ Label its leaves uniformly 
at random (since each species evolves in the same way). The resulting tree is called the 
reconstructed tree. Note that when reconstructing a phylogeny from (molecular) data. 
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we see the reconstructed tree. Extinct lineages are only apparent when the fossil record 
is included. 



Figure 1: A complete tree (left) and its reconstructed tree shape (right). 



In iNee et al.l ( 1l994j ). the reconstructed tree of a birth-death process after time t is 
discussed. However, in this paper we additionally condition on having n extant species, 
since this allows us to compare the model with phylogenies on n extant species. We 
will obtain the probability for each speciation event in a tree with n species. This has 
be en done for the Yule model and the conditioned critical branching process (cCBP) 



m 



GernhardI (120071 ) (note that the conditioned critical branching process is the critical 
branching process conditioned on n extant species). For the general birth-death process, 
the joint probabili t y for the shape and all speciation times has been established in 



Rannala and Yang (119961 ): the joi nt probability for th e spec iation times disregarding 



Yang and Rannalal (119971 ). However, no individual 



the shape has been established in 
probabilities have been established. 

For establishing the individual probabilities, we introduce the point process represen- 
tation for reconstructed trees (Section [2 ]) . This had bee n done for the critical branching 



process m 



Aldous and Popovid (120051 ): 



Popovid (120041 ). In Section [3], we calculate the 



probability distribution of the age of a given tree on n species, assuming a uniform 
prior on (0, oo) for the age of the tree. This enables us to derive the density function 
for the time of the k-th spceciation event in a tree with n extant species (Section H]) 



and its expectation (Section - assuming a uniform prior or conditioning on the age 
of the tree. In Section [6], we discuss some further properties of reconstructed trees. We 
will determine the point process when not conditioning the cBDP on the time of origin. 
Also, we describe the point process of the coalescent, the neutral model in population 
genetics. Further, we will discuss the backwards process of reconstructed trees. The 
backward process is the process of the coalescence of the extant species. 

Knowing the time of the k-th speciation event in a reconstructed tree with 
n sp ecies allows us t o cal c ulate the t ime o f a given vertex in the reconstructed 



tree (iGernhard et al 



200 



Gernhard 



20071 ). This becomes useful for dating phy- 



logenies. If we are able to reconstruct the phylogeny of extant species, but do 
not obtain speciation times, we can use the expected time of a speciation event 
as an estimate for the speciation time. Thi s estim ate has been used for the un- 



dated vertices in the primate phylogeny (jVos 



20061 ). assuming the Yule model. Sim- 



ulations were used for obtaining the expected speciation times. We provide ana- 
lytic results assuming any constant rate birth-death model. The methods are im- 
plemented in python as part of our PhyloTree package and can be downloaded at 



http : / / www-m9 . ma. tum. de / twiki / pub / AUgemeines / Tanj aGernhard / PhyloTree . zip 



In mathematical terms, a reconstructed tree is a rooted, binary tree with unique 
leaf labels and ultrametric edge lengths assigned, i.e. the distance from any leaf to 
the root is the same, see Figure Ej left tree. We denote the set of interior leaves by 
V. A ranked reconstructed tree is a reconstructed tree without edge lengths but with 
a rank function defined on the interior leaves. A rank function is a bijection from 
V {1, 2, . . . , |V^|} where the ranks are increasing on any path from the root to the 
leaves. A (ranked) reconstructed tree shape is a (ranked) reconstructed tree without leaf 
labels. A (ranked) oriented tree is a (ranked) reconstructed tree without leaf labels but 



4 



where we distinguish between the two daughter edges of the interior vertices, w.Lo.g. 
label them / and r, see Figure [2|, middle tree. Note that a (ranked) oriented tree has n\ 
possible labelings. We introduce the oriented tree to make the proofs clearer and the 
statements easier. 

Remark 1.1. The cBDP induces a (ranked) reconstructed tree in the following way. 
Consider the complete tree which evolved under the cBDP. We delete the extinct lin- 
eages and label the n leaves uniformly at random with {1, 2, ... , n} to obtain the re- 
constructed tree (there are n\2~'' possible labelings, where k is the number of cherries 
in the reconstructed tree shape). 

The interior vertices shall be ordered according to the time of speciation, this defines 
the rank function. To make the reconstructed tree oriented, for each interior vertex, we 
label the two daughter lineages with / and r uniformly at random, there are 2"~^ 
possibities. We then ignore the leaf labelings (note that each labeling of the oriented 
tree is equally likely, since each labeling of the reconstructed tree was equally likely). 

On the other hand, if we know the distribution on (ranked) oriented trees induced 
by the cBDP, we obtain the distribution on (ranked) reconstructed trees in the following 
way. We choose a labeling of the leaves with {1, 2, . . . ,n} uniformly at random from the 
n\ possible labelings. We then ignore the orientation. This gives us back the distribution 
on (ranked) reconstructed tree. Therefore, it is sufficient to determine the distribution on 
(ranked) oriented trees in order to determine the distribution on (ranked) reconstructed 
trees. Overall, let Tr be a reconstructed tree, and let Tq be a oriented tree which was 
induced by t^.. Then P[r,.] = P[ro]2"^^/n!, since a oriented tree has n\ possible labelings 
and for the n — 1 interior vertices, we have the distinction between the I and r daughter 
branches. 
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2 The point process 

In this section, we provide the density for the time of a speciation event in the re- 
constructed tree given n species today and the time of origin being at time tor in the 
past. We do that using a point process represent ation. The following point p r ocess has 



first been considered in connection with trees in 



mm . 



Aldous and Popovid (120051 ): 



Popovic 



Definition 2.1. A point process for n points and of age t^r is defined as follows. Draw 
the n points on the horizontal axis at 1, 2, . . . , n. Now pick n — 1 points to be at the 
location {i + 1/2, Sj), i = 1, 2, . . . , (n — 1); < Sj < tor- 

Lemma 2.2. We have a bijection between oriented trees of age tor and the point process 
of age tor- 

Proof. Draw the given oriented tree from the top to the bottom. At each speciation 
event, choose the branch with label r to be on the right and with label / on the left. 
On a horizontal axis, the leaves are located at position 1,2, .. .n. The speciation events 
are at the location (i + 1/2, Si), i = 1, 2, . . . , (n — 1); < Sj < tor- These are the n — 1 
points of the point process. The mapping to the point process is obviously injective and 
surjective, i.e. bijective. □ 

For completeness, we give the mapping from the point process to the oriented trees. 
Consider a realization of the point process. Connect the most recent speciation event 
with the two neighboring leaves. This speciation event replaces the two neighboring 
leaves in the leaf set. Continue in this way until all points are connected. This gives us 
the corresponding oriented tree. An example of the point process is given in Figure [2l 
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Figure 2: Reconstructed tree, oriented tree and the corresponding point process. The 
time of origin of the process is tor, the time of the most recent common ancestor is ti. 



Theorem 2.3. Each ranked oriented tree on n leaves induced by the constant rate 
birth and death process has equal probability. Note that this is true with or without 
conditioning on the time since origin. 

Proof. We first determine the probabihty of a ranked oriented tree with leaf labels. 
Consider the process backwards. We have n species today. We pick two species uniformly 
at random, which coalesce first. The one new branch gets label / and the other new 
branch gets label r. Overall there are n{n — 1) possible choices for the first coalescent 
event, each one being equally likely. The two chosen leaves are replaced with their 
common ancestor. We proceed in this way until all species are connected. There are 
n!(n— 1)! possible scenarios for the coalescent, each one being equally likely, i.e. a ranked 



oriented tree with leaf labels has probability 



1 



n!{n-l)! 



Each of the n\ possible labelings 



are equally likely, therefore the probability of a ranked oriented tree is 
the uniform distribution on ranked oriented trees of size n. 



(n-l)! 



This is 



□ 



Corollary 2.4. Each permutation of the n — 1 speciation points Si,...,s„_i in the 

point process of the birth-death process has equal probability. 

Proof. We have a bijection between the oriented trees and the point process (Lemma 
12. 2p . Choosing the n — 1 speciation points si, . . . , s^-i induces a ranked oriented tree. 
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let the probability of that tree be p. Now permute the n — 1 speciation points arbitrary. 
This induces a different ranked oriented tree. Since we have a uniform distribution on 
ranked oriented trees (Theorem 12.31) . the probability of the new tree is again p. So each 
permutation is equally likely. □ 

For obtaining the density of the speciation time Sj, we need the following results. 
Under a birth-death process, t he pr obability that a lineage leaves n descendants after 



time t is Pn{t)- From 



Kendalll (Il949l ). we know 



Po{t) 
Piit) 

Pn{t) 



Ml 



(A - /i)'e-(^-^)* 
(A-/ie-(^-'^)*)2' 

(A//i)"-Vi(t)h(t)]"-^ 



Let r be the oriented tree with n leaves and Xi > X2 > ■ ■ ■ > Xn-i the time of the 
speciation events. Note that the Xi,i = {1, 2, . . . , n — 1} is the order statistic of the 
Si, i = {1,2 n - 1|. 



In 



Rannala and Yang] (jl996l ). joint probabilities for xi, . . . , are given. The au- 



thors condition on the time t i , the time since the most r ecent common ancestor {mrca 



of the extant species. From 



Yang and Rannalal (119971 ). Equation (3), we obtain the 



density g of the ordered speciation times, X2 > x^ > . . . > Xn-i, given n and Xi = ti, 



n-l 



Pl[Xi) 



g{x2,X3, . . .,Xn\ti = t,n) = {n - 2)! TT/x- . 

t=2 PO(^) 

The variables X2, xs, . . . , x„ are the order statistic of say 82,83, ... , Sn-i. Each permu- 
tation of the n — 2 random variables S2, S3, ... , s„_i has equal probability (Corollary 
12. 4p . and therefore the density / of the speciation times is. 
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n— 1 



|, , N g{x2,X3,. . . ,Xn\tl =t,n) -p-r Pi[Si) 
f{S2, . . . ,Sn-l\tl =t,n) = — — = 11 /i- 



(n-2)! 



i=2 



which (by definition of independence) shows that the Sj are i.i.d., and therefore, 



f{si\ti = t,n) = fi 



Pl[Si) 

Po{t) 



(A -/if ' 



-{X-fi)si ^ _ ^^-{X-ii)t 

(A - /ie-(^-/^)*0^ l_e-(^-M)* ■ 



(2) 



Note that the expression for the density of Sj does not depend on n, we have the same 
distribution for any n. Therefore, we do not need to condition on n. For the distribution, 
we obtain by integrating Equation ([2]) w.r.t. Sj, 



F{s,\ti = t) 



(3) 



Note that the probabihties are conditioned on ti, the time of the mrca. It is of interest 
to condition on tor instead, the time since origin of the tree. We have the- maybe first 
seeming surprisingly- property that 



f{Si\ti =t) = f{Si\tor = t). 



(4) 



T 



To 



Figure 3: Reconstructed tree T with daughter trees 7j, 7^. We have mrca{T) 
origin{Ti) = origin{T2) = ti. 



The following argument verifies Equation (jlj). Suppose we have a tree where the 



mrca was at time ti. The daughter trees 7^, of the mrca have n, m extant species. 
The speciation times in 7^, 7^ occured according to Equation ([2]). On the other hand, 
since the two daughter trees of the mrca evolve independently, the tree 7^ can be 
regarded as a birth-death process which is conditioned to have n species today and the 
time of origin was tor = t- Therefore /(sj|ti = t) = /(sj|tor = t), see also FigureO With 
Remark 11.11 this establishes the following theorem. 

Theorem 2.5. The speciation times in a oriented tree (reconstructed 

tree) with n species conditioned on the age of the tree are i.i.d. The speciation times 
S2, . . . , Sn-i in a oriented tree (reconstructed tree) with n species conditioned on the mrca 
are i.i.d. The time s of a speciation event given (i) the time since the origin of the tree 
is tor, or (a) the time since the mrca is ti, has the following density and distribution, 



Since conditioning a tree to have the mrca at time t can be interpreted as condi- 
tioning the two daughter trees 7^ and 7^ of T to have the origin at time t, we will 
only condition on the origin of the tree in the following. 
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2.1 Special models 



2.1.1 The Yule model 

For the special case of a pure bhth process, i.e. = 0, which is the Yule model, Equation 
([2]) simplifies to 



F{s\t) 



Xe 



-As 



1 - e 
1 - e 



-xt 

-As 



1 - e 



-xt 



which has already been established in 
mrca though. 



Ned (I2OOII )- he conditioned on the time since the 



2.1.2 The conditioned critical branching process 

In a cCBP, we have A = /i. As /i ^ A, we get in the limit using Equation 
(jl]), and the property e"" ~ 1 — e for e ^ 0, 



and 



F{s\t) 



l + Xt 



(1 + As)2 t 
s 1 + Xt 
1 + Xs t ■ 



his 



l as already been established in a different way for A = 1 in lAldous and Popovic 



(120051) 



Popovid (120041 ). 



3 The time of origin 



Suppo s e nothing is kno wn about t, the time of origin of a tree. As in 



fl2005h : 



Aldous and Popovic 



Popovid (I2OO4I ). we then assume a uniform prior on (0, 00), i.e. a tree is equally 



11 



likely to origin at any point in time. Note that the prior does not integrate to 1. For 
any constant function, the integral is oo. Therefore the prior is not a density. Such a 



Bergeil fll980h . 



prior is called improper; a discussion and justification is found e.g. in 
Assuming the uniform prior, we will establish the density for t given n extant species. 
From Equation ([1]), we have the probability of n extant species given the time of origin 
is t, 

In order to derive the density for t given n, we need the following lemma. 

Lemma 3.1. Let For[n\t] be the probability that a tree has n extant species given the 
time of origin t. We have 

/ For[n\t]dt = — . 
Jo nX 



Proof. The derivative of 



]^_g-(A-n)t 
A-Ate-(^-^)* 



is, using the quotient rule. 



d f 1- e-(^-^)* 
di \\^^Jip(^^ 



'1 _ e-(A-M)t) 



n 



and therefore, 



For[n\t]dt 



X' 



n-l 



n 



1 -e 



n \ A" / \n 



J 



which establishes the lemma. 



□ 



Theorem 3.2. We assume the uniform prior on (0, oo) for the time of origin of a tree. 
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Conditioning the tree on having n species today, the time of origin has density function 



^[1 — g-(^-M)*^"-ig-(^"M)* 



qoMn) = n\^{\ - /i) _ • (5) 



Proof. With Bayes' law, we have 

^or[n\t]qor{t) ¥or[n\t]qor{t) 



qor{t\n) 



Kr[n\t]qorit) Kr[n\t] 



/~ For [n 1 1] qor {t) dt For [u \ t] dt 

^ XnFor[n\t] 

n _ g-(A-/i)t'\n-lg-(A-^)t 

= nA"(A - iJ,)'^- — -T j{ — r--— . 



□ 



Corollary 3.3. The distibution for the time of origin given n species today is 



Qarit\n) 



A - ^e-(^-^)* 



Proof. We have limt^oo Qor{t\n) = 1. Differentiation of Qor{t\n) w.r.t. t yields 

iQoritln) = n)^{X-^^ ^~l;^;X^:r^ 



-Qor{t\n) = ny\X-iif ^^ ^ '"'^^'^X^jMn+i = (lor{t\n) which completes the proof. □ 



4 The time of speciation events 



In this section, we calculate the density for the time of the k-ih speciation event given 
we have n species today. Knowing that the distribution on ranked reconsructed trees 
is uniform (Theorem 12.31) . this characterizes the reconstructed trees completely. These 



results allow us to ca^ 



tree (IGernhard et al. 



culate the density for t 



2006 



Gernhardl . 



le time of a given vertex in a reconstructed 



20071). 
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4.1 Known age of the tree 

Let A^ t be the time of the k-th speciation event in a reconstructed tree A with n 
extant species and age t. The n — 1 speciation events in A are i.i.d. and have the 
density function f{s\t), see Theo rem 12 .51 The density of A^ t is therefore the {n — k)-th. 



order statistic, which is (see e.g. 



Dehhng and Haupt 



fl2003h . Theorem 9.17), 



f^.Js) = {n-k)(^^_^^F{s\tr-'-\l-F{s\t)r''f{s\t) (6) 
for s < t and fj^k ^(s) = else. The distribution function of A^^^ is 

i=0 ^ * / 



for s < t and FAk (s) = 1 else. 



4.2 Unknown age of the tree 

If the time of origin is unknown, we assume a uniform prior for the time of origin. 
Using this assumption, we will calculate the density function for A'^, the time of the 
k-th speciation event in a tree with n extant species. 

Theorem 4.1. Let A^ he the time of the k-th speciation event in a tree with n extant 
species. We have for < /i < A, 



fM = {k + i] 



k + lj ^ (A - /ie-(^-A')^)"+i ■ 



Proof. For a fixed time t of origin, we have the density f_^k ^ for the time of the fc-th 
speciation event (Section 14. ip . With our uniform prior, the time of origin has density 
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function qor{t\n). The density /^fc(s) is therefore 



fAtM)^orit\n)dt 



(A - /ie-(^-'')'')"(l - e-(^-A')*)"-i 
nA"(A - ^^f^-^ / dt 



(A - ^e-(^-'^)*) 



n+l 



n — 1 



(A - /i)'=+3A" 



3 — (A— /i)A:s ^ g— (A— /i)s^n— fc— 1 



k )"■" " (A - /ie-(^-^)'*)" 

°o (1 _ g-(A-/j){t-s)^fc-lg-(A-Ai)t 



X 



(A - /ie-(^-'^)*) 



fc+i 



-dt 



nk 



n — 1 



(A-/i)'=+3A" 



A;(A-^)(A-/ie-(^-^)«) 



-(A— /^)A;s ^-j^ g— (A— /j)s ^n— fc— 1 

(A - /ie-(^-^)'*)" 
X _ g-(A-/i){t-s) 

A - /ie-(^-'^)* 



X 



(A; + 1) ( / J A"-^(A - /x)^+^e-(^-^)(^+^)- { ^ 



which estabhshs the theorem. 



□ 



Remark 4.2. Under the Yule model, i.e. setting /i = and A arbitrary in Theorem 

^gAs -j^^n— A,'— 1 



14.11 we have, 



fA',.is) = {k+l)[^-^^]X- 



n 



^Xsn 



which has been established in 



GernhardI (120071 ) for A = 1 in a different way. 



In the cCBP, the birth rate equals the death rate, X = fi. Taking the limit fi —>■ X 
in /_4fc,we obtain from Theorem 14. II using the property e"*^ ~ 1 — e. 



/^.(.) = (A; + 1) 



n 



^n—k—l 



A 



n—k 



k + lj (l + As)"+i 
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which has been estabhshed in iGernhardI (120071 ) for A = 1 in a direct way. 



5 Expected speciation times 

In this section, we calculate the expected time of the k-th speciation event in a recon- 
structed tree with n species analytically. Our Python implementation for dating trees 
uses the analytic results. Higher moments are calculated numerically. 

5.1 Known age of the tree 

Theorem 5.1. The expectation of A'^^ is, for < fi < X, 



eK,,1 



i=0 j=0 ^ 

su) + E E 

1=1 m=0 



1 - e-(^-A')* 



X 



n-j-l\ //-I 
/ I \ m 



\l+m 



X 



l—l—m 



(A - 



jh{j,m) 



where 



X 



(A - /i)A"-J-i 



In 



A — 



'n-j-2 . ■ o\ r 

^ (n- ] - 2\i/ 

m=l 



m I m 



/i)--(A-/i)-) 



and 



h{j, m) 



In^ 



-(A-M)t 



x-^J. ' 

(A— ;^e~(-^~'')')'"+J+2~" — (A— 
m+i+2— n 



ifm + j + 1 — n 
else. 



-I, 
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For yU = 0, we have, 



i=0 j=0 



x{k + i-jy 



(1 - e-^*)i-"(e-^'^* ~{{k + i- j)\t + l)e-('=+')^*). 



For 11 = \, we have 



fc-l i 



'-EE 



i=0 j=0 



n~l\ fi\ {~iy+^ fl + \t 



n-j-l 

\t-{n-]- 1) ln(l + At) + ^ 



n-j-l 



1=2 



Z ^ 1-/ 



The proof is found in the appendix. 



5.2 Unknown age of the tree 

A closed form solution for the first and second moment (for all k) of unde r the Yule 



and fo r all moments under the cCBP (with the setting A = 1) is given in 



Gernhard 



fl2007h . 



n 

Ef 

i=k+l 

n ^ n ^ \ 

i=A:+l i=k+l j=k+l 

z' Ti — fc + m — IN 

-YY^ if/i;>m, 
else. 



(9) 



(10) 



oo 



For general A, /i, we have the following analytic expression for the expectation (the 
proof is found in the appendix). 
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Theorem 5.2. For < fi < X, the moments of are (p := p/\), 



eK] 



k + l f n 



X \k + l 
1 



n—k—l 



n-k-l\ 1 /I 
i J{k + i + l)p\p 



k+i 



log 



1-p 



1 



For p = we have 



and for p = X we have 



n 



«=fc+i 



n — k 
Xk ' 



In particular, the expectations basically only depend on p. Different X just scale time by 
1/X. 

Knowing the expected time of the fc-th s peciation event aUows us to draw a lineage- 



through-time (LTT) plot (iNee et al. 



19941 ) analytically. In a LTT plot, the time vs. 



the number of species at that time (on a logarithmic scale) is drawn. These plots are 
frequently used for comparing the data with a model. Commonly, the LTT plots for 
different models are obtained via simulations. Since we know the expected time of the 
speciation events in our model analytically, we can plot the LTT plot analytically, see 
Figur m 
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2.5 




0.5 - 



Q I I I \ \ \ \ \ I I I 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

time (scaled) 

Figure 4: Lineage through time plot for n = 10 species. Time is scaled such that the 
mrca is at and today is 1. We have p = 0,0.25,0.5,0.75,0.9, 1 from top to bottom. 
Note that for variing A, time is scaled by 1/A (compared to A = 1). That means, since 
we scale time, the plots are the same for any A. 

6 Properties of the speciation times 
6.1 More on the point process of cBDP 

In Section [2], we showed that a reconstructed tree of a cBDP of age t can be interpreted 
as a point process on — 1 points which are i.i.d. We will see in this section, that the 
same is not true if we do not condition on the age of the tree but assume a uniform 
prior. 

From Theorem 12.51 we obtain the density function for x = (xi, . . . , the order 
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statistic of the speciation times, conditioned on the time of origin, t. 



n-1 



f{x\t,n) = {n-l)\l[ 



With the uniform prior on the time of origin, we obtain the density for x given we have 
n extant species, f{x\n), 



f{x\n) 



n 



f{x\t,n)qor{t\n)dt 

/"III (X- n)2 ~{X-ti)x, 

!A"(A-/x)Mn 



-(A-M)t 



g-{A-/i)xi 



1 



-MA-/i)(A-/ie-(^-'^)*)J,^ 



= n!A"-^(A-^)— X 1 ^ ^2 - 

If the n — 1 speciation points would be i.i.d. with density function g, we would have 
f{x\n) = {n — 1)! g{xi\n). Such a function (7 does not exist due to the Xi, i.e. the 
Si are not i.i.d. However, since each permutation of the Sj is equally likely (Corollary 
12. 4p . the Si are distributed identical. 

If we condition on the time of the mrca, Xi, we again have independent points, as 
stated in Theorem l2.5l (without using Theorem [231 ^6 could also show the independence 
by calculating the density of x conditioning on n,Xi as f{x\n,Xi) = jl^l^l^^-j = Y^f^^)- 
For /i = 0, i.e. for the Yule model, we can establish the same result. 



f{x\n) 



/■oo " 1 roo 

/ forix\t,n)qor{t\n)dt = n\X''Y[e-^''' / e-^'dt 

Jxi j_2^ Jxi 



n-1 



Xxi 



i=l 
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i.e. the Si are not independent. Conditioning on xi, we again have independent points, 
as stated in Theorem 12. 5[ 



Remark 6.1. Let us now consider the joint probabihty of the n — 1 speciation events 
and the time of origin, f{x,t\n). With t := xq, we have. 



i.e. Xo, Xi, . . . Xn-i is the order statistic of n i.i.d. random variables. 
6.2 The point process of the coalescent 

The coalescent is the standard neutral model for population genetics. The n individuals 
in a population are assumed to coalesce as follows. For the most recent coalescent event, 
pick two of the n individuals uniformly at random, the time between today and their 
coalescent is distributed exponential (rate (g)-^) where A is the rate of coalescent. We 
will show that this process - even though it is very similar to the Yule process - does 
not have a point process representation with i.i.d. coalescent points. 

Let X = (xi,X2, . . . ,Xn-i) be the order statistic of the coalescent times (with xi > 
, X2 . . . > Xn-i)- Note that Xi — Xj+i is distributed exponential with rate (*2^)A. The 
density function for x is therefore. 



n-1 



(A - ;u)2e-(^^^)^» 



f{Xo,Xi, . ..Xn-i\n 







a('+i)(x,-x,+i) 




n-1 



1=1 
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Conditioning on the time of the most recent common ancestor, Xi, we get, 



f{x\n,xi) 



f{x\n) n\{n — 1)! 



n-1 



f{xi\n) f{xi\n)2''- 



-Xixi 



1=1 



n-1 

h{xi,n) Y\ Ae~ 

i=2 



Xixi 



where h is a function only depending on Xi,n. If the n — 2 coalescent points would 
be i.i.d. with density function g, we would have f{x\n,xi) = {n — 2)\YYi-2g{.Xi,xi,n). 
However, due to the i in e~^*^% this property is not satisfied, therefore the n — 2 points 
are not i.i. d. However, in the coalescent, also each ranked oriented tree shape is equally 



likely (see lAldoud (120011 ) or argue as in Theorem 12.31) . therefore each permutation of 
the Si has the same probability. That means that the Si are identical distributed- but 
not independent. 



6.3 Backwards process of a cBDP 

In the birth-death process, extant species speciate and die with exponential waiting 
times. However, we condition the process to obtain a reconstructed tree with n extant 
species today. We will describe the backward process, i.e. determine the waiting time 
until the extant species coalesce in the reconstructed tree. 

Theorem 6.2. Under the conditioned birth death process, a pair of species coalesces 
according to density function f{s\t) from Theorem \2.5l 

Proof. Consider a fixed pair of species out of the n species. Obviously, we can put 
them next to each other on the x-axis of the point process, at location {i,i + 1). Their 
coalescent point is {i + 1/2, Si), see Theorem 12.51 The time has the distribution with 
density function f{s\t) from Theorem 12.51 □ 

In a reconstructed tree with n species, the time to the last speciation event, i.e. 
the time between the (n — l)-st speciation event and today is A^'t'^, -A^'^- The time 
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between the k-th and the {k + l)-st speciation event can be calculated as follows. First 
note that since the n — 1 points in the point process are i.i.d. with density function 
the density function g of point ji being at time Sj^ and j2 being at time Sj^ is, 

g{sj^,Sj^\t) = f{sj,\t)f{sj,\t). 

Assume the A;-th speciation event is at time r and the {k + l)-st speciation event is at 
time r — s. We have n — 1 possibilities choosing the point for the k-th speciation event 
from the n — 1 points, and n — 2 possibilities to choose the point for the {k + l)-st 
speciation event. The density function for having a speciation event at time r and t — s 
is therefore (n — l)(n — 2)/(r|t)/(r — s\t). The probability for /c — 1 speciation points 
of the remaining n — 3 speciation points being earher than r is {^Zfji^ ~ F{t))'^~^. 
The probability that the remaining n — k — 2 speciation points happend after r — s is 
F(r - sY'^-^. Overall, 

= j\ri-l){n-2) {^^ ^) {l-F{T\t)f-'F{T-s\tr-'^-'f{T\t)f{T-s\t)dT. 

The time between the k-th and Z-th speciation event {k < in a tree of age t can 
be obtained in the same way. In addition to above, we require I — k — 1 points to be 
between r and t — s, 

f<.-<.M - /V-i)(n-2)(^:j)(;:*:^^)(i-FM«))'-x 

(F(T\t) - F(t - s\t])'-'-'F{T - 5|i)"-'-V(T|t)/(T - s\t)dT. 

Note that An~t^ — A^^i is the time until a coalescent event for k species in the 
reconstructed tree. We found analytic solutions for the above integrals under the Yule 
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model (see next Section). For yU 7^ 0, the densities can be derived with numerical 
integration in the PhyloTree package. If assuming a uniform prior for the time of origin, 
we additionally need to integrate the above densities over t, weighted by qor{t\n). 



6.4 Backwards process of the Yule model 
6.4.1 Known tree age 

For the Yule model, we can calculate the time between any two speciation events in 
the reconstructed tree analytically. The time between the k-th speciation event and the 



l-th speciation event, / > k, given the time between the n-th speciat 



i on eve nt (today) 



Gernhard et al. 



(120071) for A 



and the origin of the tree is t has been calculated in 
which yields for general A to 

i=0 j=0 ^ ' 

with 5., = k{k + 1)(,|JC^7^)C=-^)C^T')^"£SSF- Note that A^^f - A^, is the 
time until a coalescent event of k extant species. 

Analogous results are not straightforward to obtain in a process with extinction. 
However, the expectation can be calculated straightforward for the cBDP, Ef^^^ — 
<J=EK,]-EK,t]. 

6.4.2 Unknown tree age 

We assume an uniform prior for the time of origin of a tree- a first ancestor species 
was created at any point in the past with equal probability. Since we want to obtain n 
species today, the time of origin has to be conditioned to see n species today. 

For the pure birth process, this is equivalent to start growing a tree and wait until the 
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tree has n species. After the {k — l)-th speciation event, we always have an exponential 
(rate k) waiting time. Therefore, also the coalescent has an exponential (rate k) waiting 
time. This is not true in a process with extinction. 

It remains to consider the time between the {n — l)-st speciation event and today. 
Note that today is not defined as the n-th speciation event, but whenever we look at 
the process. The density for the time between the (n — l)-st speciation event and today 
is 

/^.-i(.)=nAe-^- 

which is the exponential (rate n) distribution. Therefore, looking at the tree today 
is equivalent to looking at the t ree at the time of the n-th speciation event. This 
observation has been discussed in 



Hartmann et al. 



In 



Gernhard et al. 



(120071 ) 



(120061 ). the time between the fc-th and the /-th speciation event. 



/ > fc, is established for A = 1 which yields for general A to 



/a-A, (^) = A(fc + 1) ( ^ I ^ ] e-'^ - l)'-'^- 



For the general birth-death process, the waiting times cannot be calculated straight- 
forward, since the time to the n-th speciation event differs from the time until today 
(note that we might have the n-th speciation event before today - i.e. the n-th speci- 
ation event is followed by extinction). However, obtaining the expectation for is 
straightforward, ¥.[Ai - Al] = E[Al] - E[Al]. 



7 Applications 

Knowing the density and expectation of the k-th speciation time given we have n species 
today, we can obtain the density and expectation for the time of each interior node of a 
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given tree. This can be used for dating phylogenies, if only the shape is inferred- missing 
dates in phylogenies could be due to s upertree methods, morphological data o r absence 



of a molecular clock. In earlier work (iGernhard et al 



2006 



Gernhard 



20071 ). we gave 



the method and computer programs for dating phylogenetic trees. So far though, we 
only knew the speciation times for the Yule model and the cCBP. With the results in 
this paper, we can date a phylogeny assuming any constant rate birth-death model. 
The methods are implemented in our PhyloTree package for python. 

The point process representation is useful for simulating reconstructed trees on n 
taxa. If we have no extinction, we can simulate until obtaining n species. More precise, 
we stop at the n+l-st speciation event, since that is the same as stopping today (Section 
I6.4.2p . However, with extinction, simulations are tricky - we could return to n species 
again and again. With the point process, it is easy to sample trees on n species. First 
sample a time of origin according to the density QoriA^) Equation ([5]). Then sample 
n — 1 speciation times according to the density f{s\t) in Theorem 12.51 

If we want to simulate reconstructed trees on n taxa and age t, direct simulation of 
the process seems almost impossible - we simulate until t and only keep the realization 
if we see n species. We will throw away a lot of realizations (always if we do not see 
n species), therefore the time amount until we have a reasonable size of samples is 
huge. With the point process, we simply sample n — 1 speciation times according to 



the density f{s\t) in Theorem 12.51 Th ese sampling met 



methods will be discussed in detail in 



lodsand more general sampling 



Hartmann et al. 



mm . 



8 Results and Outlook 



The n — 1 speciation points in the point process representation are i.i.d. if conditioning 
on the time of origin or the most recent common ancestor. As discussed, this allows us 
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to calculate the speciation times in a reconstructed phylogeny. So far, we calculate the 
speciation time with only conditioning on the shape of the phylogeny. With the point 
process, one might be able to condition on the shape as well as on some known dates 
in the phylogeny. This would be valuable for dating supertrees, since some speciation 
times are usually known. 

In the application section, we showed that simulations of reconstructed trees become 
easy using the point process. This becomes useful for comparing the model with the 
data on aspects where no analytical results are known. 
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A Proofs 



Proof of Theorem 
been calculated in 



5.1[ Under the Yule m odel, i.e. = 0, the expectation of 



Gernhard et al.i (120071 ) for A = 1, 



E[<,(A = 1)] = 



i=0 j=0 

(1 - e'*)i""(e-^* -{{k + i- j)t + l)e-('^+')*). 



For general A, since 



we have 



Substituting x = \s yields 



kfn-l 
A k 
E[<A.(A = 1)] 
A 



-x\n—k—l 



(1-e- 



-Ann-l 
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For < /X < A, we have, 



Jo ' ' Jo ' 



i=0 j=0 



n — l\/A/n — ? — 1 

i=0 j=0 1=0 ^ / \J/ \ 

as. 



1 _ e-(A-/^)t y io (A - /xe-(^-'^)-)''"^"' 
With the substitution a; = A — fie'^^'^^'^ , we obtain for / > 0, 



/o (A - //e-(^-/^)^)"-^-' /^(A-/x)A-M 



(A - 



m.=0 \ ^ -'A-M 



m=0 



(A - 



where 



m) = 



(A-/^e~(^~»')*)"'+^+^~"-(A-/j)"''+J+^~" gigg 
m+j+2-n 
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For Z = 0, we have with the substitution x = Ae*^^ — fj,, 



1 



{X - |J,e-(^~^'>) 
(Ae(-^-'^)* - 



n~j-l 



ds 



(A - /i)A 
1 



X 



-dx 



(A - /i)A"-J- 
1 



m=0 
X 



(A - /i)A"-J-i 



In 



A — /i 



n-i-2 

E 

m=l 



m I m 



So overaU, 



Z=l m=0 



For /i = A, we obtain from Equation ffTTl) . 



^ fn-l\ fi\ , f I + \t 



i=0 j=0 



;-ir 



n-j-l pt 



+ 



cis. 
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Substituting a; = 1 + As, we get, 



t / 5 \ n-j-l 



\1 + As 

1 f'^^Wx-l 



ds 



1=0 
1 



X 



n-3-l\ (-1) 



dx 



3y diOC 



\n-j 



\t-{n-j- 1) ln(l + \t)+ 

1=2 



n-j-1 
I 



-1) 



/(l + At)-'+^-l 
l-l 



Overall, this is 



fc-i 



i=0 j=Q 



i J \jj A" -J' V t 



n-j-1 



i=2 



1-/ 



□ 



Proof of Theorem \5.S[ For /i = and for jj, = X, the expectation is established with 
Remark [4.21 and Equations ([8]) and (fTOj) . For /x 7^ and /x 7^ A we have with Theorem 



/I _ -(A-^)s\n-fc-l 

^ V^ + V (A - /xe-(^-'^)'^)"+i 



Set 



n 



(A - /xe-(^-^')^)"+i ■ 
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Therefore, 

noo poo 

n^n] = / f{s)sds = Ci[F(s)s]~ - Ci / F{s)ds 
Jo Jo 

where F{s) :— J f{s)ds. 

In the following, we calculate F{s). We do the following substitution: 



a; = -: ^ = —-r. tttt e 



A-//e-(^-'*> ds (A - //e-(^-'*)^)2 1 + //x 

This yields 



n— fc— 1 

/ /JO I \ I 

x'^'^^dx 

\ i / ' ' ' " t 

i=0 

n-k-1 , , -,\ / / _rA-//.V<i \ 



In— fe / J 



1=0 ^ ' ^ 



We have hms_,oo F{s)s — and F(0) -0 = and therefore, 

/■oo 

E[^^] = -Ci / F(s)cis. 

^0 
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Substitute x = \ — fie ^'^ 



F2{s) / F{s)ds 



n— fc— 1 



1 '^^^ fn-k-i\{-{\- ^i)y-^ 



Xn-k 

1 



i=0 
n—k—1 

E 

j=0 



-(A-At)s 



/c + i + 1 ./n V A - iie-^^-f'^ 
n- A; - 1\ (-(A-At))^-i 



^ n—k—1 

1 \ ^ / n 



E 

i=0 



i + i Jx- 



\—x 



^(A - ii)x''+'+^ 



T—dx 



Evaluating the integral yields 



F2{s) = 



A^ 



n—k 



n—k—1 

E 



n — k — \\ (A — /i) 



\i-2 



k + i + l /i'^+^+i 



X 



Therefore, with p := ///A, 



/ s n— fc— 1 / 



log 



Ji + l 
A 



A — // 

A; + 1 / n 



A \k + l 
1 



fc+i 



n-k-l\{X- 1 

i y A; + i + 1 /x'^+^+i ^ 

A 



■D' E ( 

j=0 ^ 



X — ji 

1 



1 



n — A; — 1 

I I {k + i + l)p \p 



k+i 



-- 1 X 



3 



which establishes the theorem. 
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